315 research outputs found
Self-supervised automated wrapper generation for weblog data extraction
Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives
Information Extraction in Illicit Domains
Extracting useful entities and attribute values from illicit domains such as
human trafficking is a challenging problem with the potential for widespread
social impact. Such domains employ atypical language models, have `long tails'
and suffer from the problem of concept drift. In this paper, we propose a
lightweight, feature-agnostic Information Extraction (IE) paradigm specifically
designed for such domains. Our approach uses raw, unlabeled text from an
initial corpus, and a few (12-120) seed annotations per domain-specific
attribute, to learn robust IE models for unobserved pages and websites.
Empirically, we demonstrate that our approach can outperform feature-centric
Conditional Random Field baselines by over 18\% F-Measure on five annotated
sets of real-world human trafficking datasets in both low-supervision and
high-supervision settings. We also show that our approach is demonstrably
robust to concept drift, and can be efficiently bootstrapped even in a serial
computing environment.Comment: 10 pages, ACM WWW 201
Intelligent Self-Repairable Web Wrappers
The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u
An Automated Algorithm for Extracting Website Skeleton
The huge amount of information available on the Web has attracted many research e#orts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision
Electronic and optical properties of electromigrated molecular junctions
Electromigrated nanoscale junctions have proven very useful for studying
electronic transport at the single-molecule scale. However, confirming that
conduction is through precisely the molecule of interest and not some
contaminant or metal nanoparticle has remained a persistent challenge,
typically requiring a statistical analysis of many devices. We review how
transport mechanisms in both purely electronic and optical measurements can be
used to infer information about the nanoscale junction configuration. The
electronic response to optical excitation is particularly revealing. We briefly
discuss surface-enhanced Raman spectroscopy on such junctions, and present new
results showing that currents due to optical rectification can provide a means
of estimating the local electric field at the junction due to illumination.Comment: 19 pages, 8 figures, invited paper for forthcoming special issue of
Journal of Physics: Condensed Matter. For other related papers, see
http://www.ruf.rice.edu/~natelson/publications.htm
Large-Scale Atomistic Simulations of Environmental Effects on the Formation and Properties of Molecular Junctions
Using an updated simulation tool, we examine molecular junctions comprised of
benzene-1,4-dithiolate bonded between gold nanotips, focusing on the importance
of environmental factors and inter-electrode distance on the formation and
structure of bridged molecules. We investigate the complex relationship between
monolayer density and tip separation, finding that the formation of
multi-molecule junctions is favored at low monolayer density, while
single-molecule junctions are favored at high density. We demonstrate that tip
geometry and monolayer interactions, two factors that are often neglected in
simulation, affect the bonding geometry and tilt angle of bridged molecules. We
further show that the structures of bridged molecules at 298 and 77 K are
similar.Comment: To appear in ACS Nano, 30 pages, 5 figure
Logic, Probability and Action: A Situation Calculus Perspective
The unification of logic and probability is a long-standing concern in AI,
and more generally, in the philosophy of science. In essence, logic provides an
easy way to specify properties that must hold in every possible world, and
probability allows us to further quantify the weight and ratio of the worlds
that must satisfy a property. To that end, numerous developments have been
undertaken, culminating in proposals such as probabilistic relational models.
While this progress has been notable, a general-purpose first-order knowledge
representation language to reason about probabilities and dynamics, including
in continuous settings, is still to emerge. In this paper, we survey recent
results pertaining to the integration of logic, probability and actions in the
situation calculus, which is arguably one of the oldest and most well-known
formalisms. We then explore reduction theorems and programming interfaces for
the language. These results are motivated in the context of cognitive robotics
(as envisioned by Reiter and his colleagues) for the sake of concreteness.
Overall, the advantage of proving results for such a general language is that
it becomes possible to adapt them to any special-purpose fragment, including
but not limited to popular probabilistic relational models
Developmental profile of localized spontaneous Ca2+ release events in the dendrites of rat hippocampal pyramidal neurons
Author Posting. © The Author(s), 2012. This is the author's version of the work. It is posted here by permission of Elsevier B.V. for personal use, not for redistribution. The definitive version was published in Cell Calcium 52 (2012): 422-432, doi:10.1016/j.ceca.2012.08.001.Recent experiments demonstrate that localized spontaneous Ca2+ release events can be detected in the dendrites of pyramidal cells in the hippocampus and other neurons (J. Neurosci. 29:7833-7845, 2009). These events have some properties that resemble ryanodine receptor mediated “sparks” in myocytes, and some that resemble IP3 receptor mediated “puffs” in oocytes. They can be detected in the dendrites of rats of all tested ages between P3 and P80 (with sparser sampling in older rats), suggesting that they serve a general signaling function and are not just important in development. However, in younger rats the amplitudes of the events are larger than the amplitudes in older animals and almost as large as the amplitudes of Ca2+ signals from backpropagating action potentials (bAPs). The rise time of the event signal is fast at all ages and is comparable to the rise time of the bAP fluorescence signal at the same dendritic location. The decay time is slower in younger animals, primarily because of weaker Ca2+ extrusion mechanisms at that age. Diffusion away from a brief localized source is the major determinant of decay at all ages. A simple computational model closely simulates these events with extrusion rate the only age dependent variable.Supported in part by NIH grant NS-016295
Green function techniques in the treatment of quantum transport at the molecular scale
The theoretical investigation of charge (and spin) transport at nanometer
length scales requires the use of advanced and powerful techniques able to deal
with the dynamical properties of the relevant physical systems, to explicitly
include out-of-equilibrium situations typical for electrical/heat transport as
well as to take into account interaction effects in a systematic way.
Equilibrium Green function techniques and their extension to non-equilibrium
situations via the Keldysh formalism build one of the pillars of current
state-of-the-art approaches to quantum transport which have been implemented in
both model Hamiltonian formulations and first-principle methodologies. We offer
a tutorial overview of the applications of Green functions to deal with some
fundamental aspects of charge transport at the nanoscale, mainly focusing on
applications to model Hamiltonian formulations.Comment: Tutorial review, LaTeX, 129 pages, 41 figures, 300 references,
submitted to Springer series "Lecture Notes in Physics
- …